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ABSTRACT 

Determining the effectiveness of a computer simulation model in 
duplicating a desired real world phenomenon is an important unsolved 
problem. The purpose of this paper is to model the validation pro- 
cedure in a broad context and develop a general methodology for the 
statistical part of validation. A procedure calling on utility, 
decision, simulation, and statistical theories is developed. The 
goals of statistical testing are presented, and the assumptions, prop- 
erties, and results of several parametric and nonparametric tests are 




discussed and compared. 
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I. INTRODUCTION 



With the advent of complex computer simulations of real world phe- 
nomena, a means of judging the worth or validity of the simulation has 
become very important, yet to associate with any simulation model a 
strict valid-invalid judgement is quite misleading. Models can be con- 
sidered valid under certain circumstances and invalid under others, or 
when compared using different criteria they may be considered first 
valid then invalid. 

In the past one of the largest problems of model validation has been 
the definition of the term. Normally what has been thought of as model 
validation is the statistical testing of collected data and the compari- 
son of the results with a predetermined test criterion. For this reason 
validation has come under attack for being a form of statistical chica- 
nery and merely a means to add credence to an already accepted model. 

The purpose of this paper is to redefine the validation procedure 
in a larger context and to describe some of the problems, techniques, 
and assumptions associated with the statistical portion of validation. 

It is hoped that with this procedure model validation will become a more 
definitive process and that the mystical air normally associated with 
statistical testing procedures will be removed. 
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II. VALIDATION PHILOSOPHY 



The present procedure of validation is basically as follows. After 
the requirement to validate a model has been given, agreement on a sig- 
nificance level for statistical testing is reached. The data is then 
given to a statistician to find an appropriate testing procedure. Using 
the predetermined level of significance it is then determined whether the 
difference between the real world and model data is significant. The 
decision-making procedure is quite simple, if the results of the test 
indicate a significant difference then the model is said to be invalid. 

If the difference is not significant then it is considered valid. This 
procedure is shown in Figure 1 and is the one used in a recent 
validation [12].* 

This type of apparently straightforward validation has two basic 
problems. The decision rule while seemingly well defined is actually 
more complex and involves such things as cost and utility models as well 
as statistical theory. As an example, in a validation done by the 
Systems Analysis Group [23] the level of significance was set at a level 
of .5 in order to make the probability of accepting an invalid model 
small. This decision must have involved consideration of the costs of 
accepting an invalid model and of rejecting a valid model. After com- 
piling these costs the principles of utility theory must have been used 
in arriving at a figure of .5 as the best level of significance. But, 
none of these considerations were mentioned in the report of the valida- 
tion. So in the past, and even on present validation projects, the 

* Number in brackets is reference number as listed on pages 50 and 51. 
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CONDITIONS 




FIGURE 1 

A PRESENTLY USED MODEL VALIDATION SCHEME 
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decision rules are not fully explained, but should be if meaningful 
validation is to be achieved. The second problem is that much confusion 
exists in the method of selection of a particular statistical test. In 
very few cases, if ever, will a test be perfect for the data. An assump- 
tion will therefore be relaxed slightly to make use of a strong property 
of a test, yet if another test were used that assumption might not have 
to be violated. The question of which assumptions and properties of a 
test are most important is very complex and the answers not clearly 
defined. Thus the properties and goals of statistical tests must be 
more completely defined. It must also be realized that while the passing 
or failing of a single test or at the most of a few tests constitutes a 
decision rule now, the results of these tests should only correspond to 
a single element of an n dimensional decision vector. 

The philosophy of the present validation procedure is sound. What is 
proposed is a new procedure, rather than philosophy, directed at allowing 
the decision maker more flexibility. This involves describing decisions 
in terms of utility, simulation, and decision theory as well as just 
collected data and statistics. 

The procedure can be thought of as the interchange of information 
between three modules. Simulation Theory, Statistical Theory, and 
Decision-Utility Theory. Within each module there are several nodes 
such as the testing node in the statistical module. Several nodes such 
as the criterion node share modules. In general the procedure would work 
as follows. Information concerning the model and real world, such as 
data, flow from the state of nature node through the validation informa- 
tion node and into the decision node. From the decision node several 
paths exist. The validation may be terminated, by accepting or rejecting 
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the model, the model may be temporarily rejected while further data is 
gathered or additional comparisons conducted, or the decision may be to 
compare the information by means of a statistical test. Regardless of 
the choice, if the validation is not terminated more information enters 
the validation information node as a result of the decision and the pro- 
cess will continue in a cycling manner until the decision to terminate 
the validation is given. Figure 2 illustrates the concept of modules and 
nodes interacting to form a validation procedure. 
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FIGURE 2 

PROPOSED VALIDATION SCHEME 
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III. VALIDATION DECISIONS 



There are four possible outcomes of any decision rule. These are 
based on the two decisions: 

Dq = Decision the model is valid 
D.| = Decision the model is invalid 
and the two possible underlying states of nature: 

Sg = The model is valid 
S.| = The model is invalid . 

One possible outcome is DgS.| or the incorrect decision that the model is 
valid when it is invalid. The other outcomes are DgSg, D-iSg, and D.|Si. 
Why the decision maker makes a particular decision depends upon the 
decision rule he is using. As an example of a simple decision rule con- 
sider a validation procedure which is similar to the one presently being 
used. It consists of one model, the two states of nature Sg and S.| , and 
the two decisions Dg and D.j . The statistical test is exact, that is: 

P,(x* = XjISj,) - 1 

Pr(X* = X, |S,) = 1 

or when the state of nature, S, is Sg then the test statistic X* is Xg 
and similarly when S = S-| , then X* = x.| . The simple decision rule is: 
if X* = Xg then D = Dg and if X* = x-j then D = D.j . Again note that this 
is basically what is done in present validations. A test is performed 
and according to the results of the test alone the model is said to be 
valid or invalid. 

Expanding the above, consider the following procedure consisting of 
the same model and states of nature. The test is no longer exact. Now, 
when S = Sg, X* is a random variable with density function Pg(x) and when 
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S = S-j then X* is a random variable with density function p-|(x). X* 
represents the test statistic whose range is the real line X. Since X* 
is now a random variable the decision rule may become more complex but 
the possible results are still the same. 
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FIGURE 3 

RESULTS OF DECISION RULES AND THEIR PROBABILITIES 

The probabilities of making the decisions are: 

a = Probability of rejecting a valid model 
3 = Probability of accepting an invalid model 
1-a = Probability of accepting a valid model 
1-3 = Probability of rejecting an invalid model. 

As an example of another decision rule consider a case where, 

Pq(x) is normal (0,a^) 
p.|(x) is normal (y,a^) u > 0. 

Let Xq be that portion of the real line, X, such that all points in Xq 
are less than or equal to X^, while the other points constitute X.| , thus 
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Xq = {x : X <_ X } 

X-| = {x : X > X^} 

X^ is an arbitrary point and can be determined either before or after 
data is collected depending upon the decision rule. If X^ is deter- 
mined before the test is performed then the decision rule might be 
that if X* is in X.j then D.j , otherwise Dq. VJith this decision rule the 
corresponding decision probabilities are: 



P^CDolS,) = 6 

Pr^^l I ^0^ “ ^ 



p.|(x)dx = probability of accepting an invalid 



model 



PQ(x)dx = probability of rejecting a valid 



model 



Pr(Do|So) = 1-a = pQ(x)dx = probability of accepting a valid 



Xq model 



p (D, |S,) = 1-3 = p,(x)dx = probability of rejecting an invalid 
' Jx^ ' model 

Figure 4 shows the functions graphically with X^ and the probabili- 
ties, a, and 3. 




■Xq- * )■ 

FIGURE 4 



GRAPHIC DISPLAY OF POSSIBLE DISTRIBUTIONS OF TEST STATISTIC 
ASSUMING Pq(x) = n(0,a^) AND p^(x) = n(y,a2) 



It should be noted that Figure 4 represents a much simplified pair 
of density functions. The nature of validation and the associated test 
statistics often prevents any knowledge of the exact distribution of 
p-|(x). In only one of the tests performed in this paper can 3 be 
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readily determined. Another complicating feature of validation is that 
p-|(x) usually flanks Pq(x) or even overlaps Pq(x) over its entire 
domain. Figure 5 shows these possibilities. 




^ ^Xq ^X-| ^ 

FIGURE 5 



POSSIBLE INTERACTIONS BETWEEN Pq(x) AND p^(x) 



In both parts a) and b) of Figure 5 the previous decision rule could 
have been used, but by associating costs with the various results of a 
decision rule, a payoff matrix can be formed and another type of deci- 
sion rule employed. 

Utility theory and cost structures will determine the values of the 
decision matrix but in general the S^Dq element will represent the 
highest cost for it represents the acceptance of an invalid model and 
thus all subsequent decisions based on the assumption that the model is 
valid will also be in error. SqDq will usually cost the least but the 
ordering of and SqD-j will vary depending on the costs of additional 
experimentation and realignment of the model. 
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With the costs of each decision result determined, and the decision 
maker willing to accept a priori knowledge of P^(S = Sq) and P^(S = S-j), 
then he can by using a rule such as minimax chose a decision which will 
minimize cost. If through information received from the statistical 
module he is willing to accept values of a and 6 then the costs of the 
decisions will become expected costs and by using the same minimax deci- 
sion rule the minimum expected cost can be found. Thus with this deci- 
sion rule, the relationship between the statistical and decision modules 
can be seen as one in which additional information received from the 
data is transmitted to the decision maker allowing him access to more 
information about the model and helping him to refine his decision. 

The choice of which decision rule to use is a subject in itself and 
is left for future study. Regardless of the method chosen though, the 
value of statistical information from the data is apparent. 

How to realign the simulation model, when the decision to reject it 
is made, is the subject of simulation theory. This also is a complex 
field and left for future study. 
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IV. STATISTICAL THEORY MODULE AND TESTS 



Two measures of the probability of rejecting a valid model are a and 
P where P is determined by finding the largest value of a for which the 
null hypothesis that the model and real world are sampling from the same 
distribution can be accepted given the test statistic. Thus P repre- 
sents the value of a at which the decision concerning the null hypothe- 
sis passes from acceptance to rejection and is determined after the 
test statistic is computed, whereas a is arbitrarily predetermined and 
used to compare with the results of the test. When comparing different 
tests, two approaches could be used. Either an a level of significance 
could be predetermined and each test receive a pass or fail rating, or 
a P value could be determined. For more sensitive comparisons P values 
will be determined when applying data to tests in the following 
sections . 

The statistical module of model validation operates as follows. 

Real world data is compared with model simulation data by one of many 
statistical tests. For each test the Pq(x) is known and the value of 
the test statistic computed. Given Pq(x) and the test statistic the P 
value is determined along with B if P-|(x) is known. This information 
is then passed into the test results node for further transmission into 
the validation information node. If a was predetermined and the test 
statistic X* fell into the critical region, i.e., P was less than a, 
then based on that particular test the decision that the model be 
accepted cannot be endorsed. The decision to reject or accept a model 
could be thought of as an n dimensional vector of which the test result 
is merely one component. 
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The types of statistical test previously used for model validation, 
and most likely to continue being used, are both parametric and non- 
parametric. Before describing several of these tests, a description of 
their inherent differences should be useful. 

A. PARAMETRIC AND NONPARAMETRIC TESTS 

A parametric statistical test is a test in which specific assump- 
tions, such as y = 0 or y = y , about the parameters of the sampled 

1 2 

population are made, whereas a nonparametric test, as the name implies, 
makes no assumptions about the value of the parameters in the sampled 
population but rather assumes only that a distribution exists. Another 
term often used interchangeably with nonparametric is distribution free. 
A distribution free test differs from both the parametric and nonpara- 
metric tests in that it makes no assumptions about the form of the 
sampled distribution. In this paper the terms nonparametric and distri- 
bution free will be used interchangeably. More important than the dif- 
ference in the definitions is the difference in the assumptions which 
must be made when testing parametrical ly vice nonparametrically. To 
determine critical values both tests require that the distribution of 
the test statistic be fully known. In the case of the parametric tests 
this often requires that the sample size be large so that the asymptotic 
distribution of the test statistic is known. The distribution, Pq(x), 
of the test statistic in the nonparametric case is generally known pre- 
cisely and need not be assumed. Other assumptions of the parametric 
tests may include independence of observations, underlying normal dis- 
tribution of the sampled populations, homoscedastici ty or at least, 
known ratio of variances among populations in the case of a multiple 
sample test, and that the data is measured in at least an interval 
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scale, meaning that operations with the data are isomorphic to arithmetic. 
The assumptions associated with nonparametric tests include only that 
sampled populations be continuous, and in some cases, be symetric or 
identical. As with parametric tests, the observations are assumed to 
be independent. For a'more complete discussion of the assumptions see 
[ 2 , 21 ]. 

The more practical advantages of the nonparametric tests include 
their intuitive attraction, simplicity of derivation, and ability to 
be understood conceptually. They are often times easier to apply, but 
this quality deteriorates rapidly as the sample size increases past 30. 
Perhaps the largest advantage, however, is their statistical efficiency.* 
As Bradley explains [3]: 

When judged by the mathematical criterion of statistical effi- 
ciency, distribution-free tests are often superior to their 
most efficient parametric counterparts when both tests are 
applied under "nonparametric" conditions, i.e., conditions 
meeting all assumptions of the distribution-free test, but 
failing to meet some of the assumptions of the parametric 
test. When both tests are applied under "parametric" condi- 
tions, i.e., conditions meeting all assumptions of the para- 
metric test, and therefore of both tests, distribution-free 
tests are usually very slightly less efficient at small sample 
sizes, becoming increasingly less efficient as sample size 
increases . 

Thus with large samples the parametric tests are more powerful provided 
that their assumptions are met. This margin of power enjoyed by the 
parametric tests decreases with sample size until the sample size becomes 
small enough, 6-10, that the power differential is insignificant. On the 
other hand, when the parametric assumptions are falsely made, but the 



* Power or statistical efficiency is defined as the ratio of the para- 
metric test sample size to the nonparametric test sample size in order 
to make the power of the two tests equivalent. If the power efficiency 
of a nonparametric test is 96%, then if the more powerful parametric 
test has 10 samples the nonparametric test must have only 10/. 96 = 10.4 
samples to be of equal pov/er. 
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nonparametric assumptions are not, the nonparametric tests are often 
more superior. 

Since one of the underlying assumptions in validation is that the 
sample size of real world data will be quite small it is necessary to 
consider the effect of the parametric and nonparametric assumptions in 
terms of small samples, i.e., less than 10. Again according to Bradley, 
when the parametric assumptions are violated they have their most 
drastic effect and in addition are most unlikely to be detected due to 
the small sample size. If a parametric test can be used, even though 
it is more powerful, its advantage over the nonparametric test is slight 
due to the small sample size. 

Because of the many facets of both types of tests it would be foolish 
to say that only tests of a single type should be used. Equally as 
foolish would be an attempt to categorize the types of data to be 
validated with specific statistical tests. This choice remains in the 
decision node of the validation procedure. So rather than attempt such 
a recipe it is beneficial to look at what has been done in several 
validations and what the differences in critical regions or P values are 
when tests requiring various assumptions are performed on the same data. 
In order to examine these differences, a sensitivity analysis on two 
sets of data with respect to statistical tests and their inherent 
assumptions was performed. 
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V. NONPARAMETRIC TESTS WITH SUBMARINE DATA BASE 



The data used in the nonparametric tests of this section was obtained 
from a sequence of submarine exercises in which a submarine attempted to 
detect another submarine transitting through a defined region. If 
detection was made then both the range and aspect of the detected 
submarine were recorded. A stern aspect indicates a retreating contact 
and the corresponding range would be negative. A positive detection 
range indicates a bow aspect at initial detection of an incoming sub- 
marine. Thus for a given exercise the data might be: out of 10 

possible detections initial detection was made at 8, 3, -6, and 1 miles. 
This should be interpreted as follows: the exercise was run 10 times 

and detection occurred on 4 runs. On these runs the initial detection 
was made when the transitting submarine was at a range of 8, 3, and 
1 miles and closing, while on the fourth run detection was made at 6 
miles but the range was opening. These exercises were simulated on 
the computer and similar results tabulated. Summarizing, the following 
constitutes the data base for this validation. The submarine model was 
tested under 10 various conditions such as speed and depth. Calling 
each set of conditions an input, there are 10 distribution functions 
each of which corresponds to an input. For each of the inputs there 
are several samples from the real world exercises and many, 100-120, 
samples from the computer simulation model. Two measures of effective- 
ness have been observed namely the frequency of detection and range of 
initial detection. 

Once again, the goal of the statistical module is to determine 
Pq(x), and p^(x), the size of the critical region or P value, and B. 
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Where the distribution of the test statistic under the null hypothesis 
that the real world and model are sampling from the same distribution is 
Pg(x), and p^(x) is the distribution of the test statistic when the two 
distributions are not the same. 

A. TESTS 

The tests chosen to illustrate what might be done in validating the 
submarine model include the nonparametric Kolmogorov-Smirnov Test, the 
Fisher Exact Test, the Wilcoxon Test, and the tests used in the initial 
validation of this model [24]. 

B. KOLMOGOROV-SMIRNOV TEST 

Perhaps the most heuristic of the statistical tests is the Kolmogorov- 
Smirnov Two Sample Test often referred to as the Smirnov Maximum Devia- 
tion Test. The test statistic is the maximum deviation between the two 
empirical cumulative distribution functions. 

To compute the test statistic rank the n real world and m model 
observations and give each a subscript corresponding to its rank. For 
each possible rank i, i=l,...,n+m, calculate d^ . Where: 

1 n m 
and 

r. = the number of real world observations less than the ith 
' order statistic 

s. = the number of simulated observations less than the ith 

order statistic. 

The test statistic, D, is max |d^. |, i=l,...,n+m. Under the hypothesis 
that the observations came from the same distribution, the distribution 
of D is known and can be calculated for any combination of n+m [4,20]. 
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As an illustration of this test consider the 10th input where n+m 
is 106, and n=3. The ranges of real world detections were ranked among 
the model detections and the values of i are i = 7,26,38. The two step 
functions are shown in Table I. 



TABLE I 

CUMULATIVE STEP FUNCTIONS IN KOLMOGOROV-SMIRNOV 
TEST WITH INPUT 10 OF SUBMARINE DATA 




D occurs at i = 38 where 

d^g = 3/3 - 35/106 ^ .67. 

Using the approximation to Pg(x), the P value is found to be .2024. 

Table II gives a summary of the P values when the test was applied in a 
similar fashion with the remaining 9 inputs. 

This test has all the previously mentioned advantages of nonpara- 
metric statistics, especially intuitive appeal. It also has the 
advantage of testing for differences in the distributions caused by all 
the properties of the distribution function instead of just the dif- 
ferences in mean or variance. 

A major restriction is placed on the validity of the results by 
using the approximation to Pq(x). flodges [15] has shown that as m and n 
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increase the approximation may differ significantly from Pq(x), thus not 
only does power efficiency decrease with increase sample size but the 
approximation of Pq(x) also becomes less valid. The effect on P of 
approximating the distribution function can be shown by comparing the 
results of this approximation with those of an exact test. This is 
shown in Table XVII, page 45. 



TABLE II 

SUMMARY OF RESULTS FOR KOLMOGOROV-SMIRNOV 
TEST WITH SUBMARINE MODEL DATA 

INPUT 
1 
2 

3 

4 

5 

6 

7 

8 
9 

10 

C. WILCOXON TEST IN THE ORIGINAL VALIDATION 

In the original validation of the submarine encounter model and the 
associated data [26], the Wilcoxon Test was used with the initial range 
of detection data. In this test the observations from both sources for 
a given input are aggregated and ranked in order of magnitude. If both 
model and real world are sampling from the same distribution then all 



P VALUE 
.485 
.260 
.980 
.998 
.941 
.491 
.922 
.577 
.792 
.202 
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combinations of ranks are equally likely. The smallest and 
largest possible rank sums make up the critical region since they 
represent the least likely results. For example, suppose the real 
world detections occurred at 1, 5, 8 miles out of 6 runs and out of 
20 runs the simulation results had initial detection at 6, 10, 12, 14, 

20 miles. The summed ranks for the real world would be l+2+4=7 and 
at an alpha level of .109 the difference between real world and model 
results is significant. 

Table III lists the acceptance regions in ranked sums for various 
model and real world outputs. In each case the level of significance 
is .109 and the real world ranks are to be summed. Exact P values were 
not found in the original validation. 

There are two difficulties with the original use of the test 
however. In an apparent attempt to avoid the tedious counting proce- 
dures outlined in the next section, the model data was divided into 
sets of 20 then tested with the set of real world data. Each test was 
considered independently, thus the level of significance is considered 
to be (1-.109)® or .5. This is derived by the following argument. 

Since a was predetermined as a ^ .5 then (1-P^)’'^ _< .5 where P^ is the 
probability of failing the test if the state of nature is Sq, and n is 
the number of tests. If n = 6 then P^ becomes .109. It seems very dif- 
ficult to believe however that these tests are independent if the same 
real world observations are to be used in each test. The second dif- 
ficulty is the method in which the rank sums were determined. Instead 
of considering an initial detection of 8 miles differently than one of 
-8 miles, both were given the same rank. 
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TABLE III 



RANK-SUM ACCEPTANCE REGIONS WITH a = .109* 



ro 


2 


3 


4 


5 


6 


2 


- 


7-12 


11-18 






3 


3-8 


7-14 


12-21 






4 


3-10 


8-17 


12-23 






5 


4-12 


8-18 


13-26 






6 


4-13 


9-21 


14-29 






7 


4-15 


10-24 


16-33 


23-42 




8 


5-17 


10-26 


16-35 


24-46 




9 


5-19 


11-28 


18-38 


25-49 




10 


5-20 


12-31 


19-41 


27-53 




11 


6-22 


12-32 


20-44 


29-57 


38-70 


12 


6-24 


13-35 


21-47 


30-60 


40-74 


13 


6-25 


13-37 


22-50 


32-64 


42-78 


14 


7-27 


15-40 


23-53 


33-67 


43-82 


15 


7-29 


15-42 


24-56 


34-70 


45-86 


16 


8-31 


16-45 


25-59 


36-73 


47-90 


17 


8-32 


16-46 


26-62 


37-78 


49-95 


18 


8-34 


17-49 


28-65 


39-82 


51-99 


19 


9-36 


- 


28-67 


40-85 


53-103 


20 


- 


- 


30-71 


41-88 


55-107 


R^ = No 


. of detections 


by model 


in sample size 


20. 




R 2 = No 


. of detections 


. in real 


world exercise, 


with 6 runs. 





* By permission from Submarine ASW Encounter Simulation Model Detection 
Validation (u) by Systems Analysis Office, ASW Systems Project Office 
(1967). 
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Both these difficulties are corrected and a more exact test made in 
the next test. Table IV presents a summary of results using this 
testing procedure. 



TABLE IV 

SUMMARY OF THE ORIGINAL WILCOXON TEST 
RESULTS USING THE SUBMARINE DATA 

RUN NUMBER 
1 
2 

3 

4 

5 

6 

7 

8 
9 

10 

D. EXACT WILCOXON RANK SUM TEST 

An improvement over the original use of the Wilcoxon Rank-Sum Test 
in validating this model can be made by not dividing the model data 
into subsets but rather considering it as a sample of size 120, and by 
determining the exact distribution of the Wilcoxon Test statistic. In 
order to determine the distribution of Pq(x) it is necessary to compute 
such things as the number of possible ways 4 numbers can be sampled 
without replacement from the positive integers 1 through 124 such that 
their sum is always less than or equal to 165. A recursive counting 



P VALUE 

greater than .5 
greater than .5 
greater than .5 
greater than .5 
greater than .5 
greater than .5 
greater than .5 
less than .5 
greater than .5 
less than .5 
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procedure for large numbers like 165 has been developed by Fix and 
Hodges [10] and was used in determining P values for 4 of the 10 inputs. 
As an approximation to the test it can be realized that the ranks form 
a finite population, thus the expected value and variance of the average 
rank sum can be determined exactly. The distribution of the average 
value of the observed ranks minus its expected value and divided by its 
standard deviation is approximately the unit normal. Kruskal and Wallis 
[16] have suggested an addition correction for continuity when using 
this approximation. 

As an example of the effect of the normal approximation observe the 
summary of P values using the Wilcoxon Rank-Sum Test in Table V. 

These are the first results based on the exact distribution of 
Pq(x), but as shown by the results in Table V it is evident that the 
approximations are quite close. Again the inherent advantages of the 
nonparametric statistics are present but also is the lack of knowledge 
of p-|(x). For a more complete discussion of the efficiency of this 
test see [5]. 

E. BINOMIAL TEST IN THE ORIGINAL VALIDATION 

The other measure of effectiveness used in the statistical portion 
of the validation of this model is the probability of detection, P^. 

In the original validation [25] a very inexact test was used. If m 
model runs and n real world runs were to be compared then the probabili- 
ties of each possible outcome were estimated. These probabilities are 
shown in Table VII for n equal 4 and m equal 20. 

To see the inexactness of this test consider how the probabilities 
are determined. The probability of x detections in n runs is 

'"Ip^n-p )(n-x). 

lx; d a 
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TABLE V 



SUMMARY OF SUBMARINE MODEL P VALUES WITH 
DETECTION RANGE DATA USING THE WILCOXON RANK-SUM TEST 



INPUT 


EXACT 


APPROXIMATION 


1 




.3140 


2 


.0066 


.009322 


3 




.7900 


4 




.7860 


5 


.8735 


.8728 


6 


.2351 


.2448 


7 




.5666 


8 




.3600 


9 




.2684 


10 


.0952 


.0902 



TABLE VI 





SUMMARY OF P VALUES 
TESTS ON SUBMARINE MODEL 


USING NONPARAMETRIC 
RANGE OF DETECTION DATA 


INPUT 


K-S 

TEST 


WILCOXON RANK 
SUM TEST 


ORIGINAL RANK 
SUM TEST 




APPROX. 


EXACT 


APPROX. 


APPROX. 


1 


.485 




.3140 


> .5 


2 


.260 


.0066 


.009322 


> .5 


3 


.980 




.7900 


> .5 


4 


.998 




.7860 


> .5 


5 


.941 


.8735 


.8728 


> .5 


6 


.491 


.2351 


.2448 


> .5 


7 


.922 




.566 


> .5 


8 


.577 




.3600 


< .5 


9 


.792 




.2684 


> .5 


10 


.202 


.0952 


.0902 


< .5 
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If the null hypothesis is true and the model and real world runs are 
independent then the probability of observing y out of m detections from 
the model and x out of n detections from the real world is: 



Pr(x.y) = 



.yj 






Pd(l-Pd) 



(n-x) 



P (x,y) is a concave function and by taking derivatives with respect to 



P^, it can be shown that if 0 < x+y < m+n then: 

(m+n) - (x+y), 



P^(x,y) 1 



f \ 

m 


n 


x+y^ 


u+y) 


f 

1 


_ x+y^ 


.yj 


.X. 


n+m 

V. / 






m+n 

✓ 



P(x,y). 



Thus P(x,y) is an upper bound on the probability that x and y detections 
will occur. P(x,y) are the values listed in Table VII. 

The data has again been divided into groups of 20 and thus the 
critical region reduced to .109 for each test. For the particular values 
of n and m the critical region has been partioned in Table VII. Should 
a pair (x,y) fall into this region for any of the six tests then the 
hypothesis is rejected at the .5 significance level. 

The primary objection to the test besides the division of model 

/\ 

observations into groups of 20 is the fact that each P(x,y) is equal 
to or larger than its exact value yet the size of the critical region 
is still assumed to be .109. This would seem to indicate that when the 
null hypothesis is accepted using this test, it might be rejected when 
using a more exact test. This is in fact the case as shown in Table 
XI, page 37. 



F. FISHER EXACT TEST 

When using the number of detections divided by the number of runs 
to test model validity, tests having more exact knowledge of Pq(x) are 
also available. One such test is the Fisher Exact Test based on the 
hypergeometric distribution. The P value is determined by computing 
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TABLE VII 



P (x,y) VALUES FOR n = 4 AND m = 20* 



X 

y 


0 


1 


2 


3 


4 


0 




.062623 ^ 


^ ^ .006144 


.000474 


.000021 








c************, 






1 


.313114 


.081919 


.014194 ’ 


.001611 


.000093 


2 


.194556 


.089891 


.022945 ’ 


: .003524 


.000262 


3 


.134835 


.091778 


.031708 ’ 


: .006277 


.000583 


4 


.097514 


.089873 


.040012 ’ 


.009900 


.001125 










f************v 




5 


.071870 


.085359 


.047517 


.014391 ’ 


.001973 


6 


.053350 


.079194 


.053965 


.019721 5 


.003230 


7 


.039598 


.071953 


.059163 


.025835 ’ 


.005023 


8 


.029231 


.064093 


.062972 


.032647 ^ 


.007509 














9 


.021365 


.055975 


.065294 


.040045 


.010883 


10 


.015393 


.047883 


.066074 


.047883 


.015393 


11 


.010883 


.040045 


.065294 


.055975 


.021365 




*********** 


r 








12 


.007509 ] 


J .032647 


.062972 


.064093 


.029231 


13 


.005023 i 


5 .025835 


.059163 


.071953 


.039598 


14 


.003230 ; 


1 .019721 


.053965 


.079194 


.053350 


15 


.001973 • 


r .014391 


.047517 


.085359 


.071870 








■ 






16 


.001125 


.009900 ’ 


.044012 


.089837 


.097514 


17 


.000583 


.006277 ’ 


.031708 


.091778 


.134835 


18 


.000262 


.003524 ’ 


.022945 


.089891 


.194556 


19 


.000093 


.001611 1 


.014194 


.081919 


.313114 


20 


.000021 


.000474 


.006144 3 


: .062623 


- 



NOTE: 1. Table entries represent the equation: 



P (x,y) 



4' 




'20' 


x+y 


,XJ 




. y. 


24 



x+y' 

“24 



24- (x+y) 



2 . 



P (x,y) values lying between shaded (****) region define the 
acceptance region. 



* By permission from Submarine ASW Encounter Simulation Model Detection 
Validation (U) by Systems Analysis Office, ASW Systems Project Office 
(1967). 
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TABLE VIII 



SUMMARY OF THE ORIGINAL BINOMIAL 
TEST USING THE SUBMARINE DATA 



RUN NUMBER P VALUE 



1 


greater 


than .5 


2 


greater 


than .5 


3 


greater 


than .5 


4 


greater 


than .5 


5 


greater 


than .5 


6 


greater 


than .5 


7 


greater 


than .5 


8 


less 


than .5 


9 


greater 


than .5 


10 


less 


than .5 



the probability of receiving the exact combination of model and real 
world detections as well as any of the more extreme combinations. 
Consider the data as presented in Table IX. 



TABLE IX 

FISHER EXACT TABLEAU WITH 
SUBMARINE DATA FROM INPUT 7 

NUMBER OF NUMBER OF 

DETECTIONS NON-DETECTIONS TOTAL 

REAL WORLD 358 

MODEL 80 40 120 



TOTAL 



83 



45 



128 



The probability of receiving this combination of detections and non- 



detections is: 



83! 45! 8! 120! 

128! 3! 5! 80! 40! 



.07905 



For a proof see [6]. 
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The more unlikely combinations, keeping the totals fixed, and their 
probabilities are listed in Table X. 

TABLE X 

MORE EXTREME TABLEAUS IN FISHER EXACT TEST FROM INPUT 7 





DET. NONDET. 


DET. 


NONDET. 


DET. 


NONDET. 


REAL WORLD 


2 6 


1 


7 


0 


8 


MODEL 


81 39 


82 


38 


83 


37 


PROBABILITY 


.01953 


.002653 


.0001518 



The sum of all these probabilities is .10183; but this represents the 
critical region in only one tail of Pq(x) and since the alternate 
hypothesis is compound the sum must be doubled. P is therefore .20277. 
The results of this test with all 10 inputs are listed in Table XI. 

These exact probabilities can be very tedious to compute and again 
approximations are available. A normal approximation when the sample 
size in large is described by Brownlee [7], along with guidelines on 
when the approximation is valid. Unfortunately none of the input 
results met the criterion but in three cases they came reasonably close. 
The results of using the approximations are listed in Table XI. 

Along with the standard attributes of nonparametric statistics 
the Fisher Exact Test and its normal approximation both have well 
defined power, 1-B, functions associated with them. 1-B for input 1 
was computed using the methods suggested by Brownlee [8] and is listed 
in Table XI as .529. 

* Working in reverse it has been shown by Tocher [27] that the Fisher 
Exact Test can be used when the conditions of the normal approximations 
do not hold. 
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TABLE XI 





SUMMARY 
USING THE 


OF SUBMARINE MODEL P VALUES 
FREQUENCY OF DETECTION DATA 




INPUT 


FISHER EXACT 
TEST 


NORMAL 

APPROXIMATION POWER 


ORIGINAL 

BINONIAL 


1 


.954 


.928 .529 


> .5 


2 


.932 




> .5 


3 


.7772 


.668 


> .5 


4 


.894 




> .5 


5 


.968 




> .5 


6 


1 .000 


.984 


> .5 


7 


.203 




> .5 


8 


.031 




< .5 


9 


.266 




> .5 


10 


.101 




< .5 



For information concerning the power function and the Fisher Exact Test 
see [1,14,19]. Thus for the first time in the tests described, p-|(x) 
and Pg(x) can be found. 

G. SUMMARY OF TESTS WITH SUBMARINE DATA 

This concludes a far from exhaustive presentation of possible sta- 
tistical tests which could be used in validating the submarine model. 
Hopefully the types of assumptions that are necessary in nonparametric 
testing are reasonably clear. Brownlee, Bradley, and Seigel give a far 
more in-depth discussion of nonparametric statistics in their texts 
referenced in this section. For a more complete discussion of the 
power of nonparametric tests see [9] as well. 

Before going on to a parametric test and one where dependence among 
samples is considered, examine the difference in the critical regions 
obtained by using various tests requiring slightly different assumptions 
of the distribution of the test statistic. Tables VI and XI, pp. 32 and 37. 
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While care must be used in explaining the cause of the differences, 
it is certainly safe to say that the assumptions of the Wilcoxon Rank 
Test are most closely adhered to while the ranking procedure of the 
original rank sum test and the approximation of Pq(x) in the Kolmogorov- 
Smirnov Test would tend to discount the validity of their results. 
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VI. PARAMETRIC AND NONPARAMETRIC TESTS WITH AIRCRAFT DATA BASE 



The data to be used in the statistical tests of this chapter v/as 
obtained from eight independent aircraf t-submarine exercises. In each 
exercise aircraft monitored a string of eight sonobuoys in an attempt 
to gain and maintain the detection of a transitting submarine. All 
exercises were made under similar conditions and therefore the condi- 
tions can be considered identical. The measure of effectiveness in 
these exercises is detection modulus or probability of detection. 
Detection modulus, D.M., is computed by dividing the total number of 
minutes detection was held by the total number of minutes detection 
could have been held. 

n _ TIME DETECTION WAS HELD 

TIME DETECTION COULD HAVE BEEN HELD 

For each of the eight runs the range and aspect of initial detection 
for each buoy was tabulated. Also recorded were the range and aspect 
at the time of losing contact, and the same information in the event 
that contact was regained. Because of this extensive data base there 
are several random variables which might be tested. All of these fall 
into two categories; however, those in which the assumption is made 
that the samples are independent and identically distributed and those 
which assume only that the samples are identically distributed. The 
Paired t, Wilcoxon, and Kolmogorov-Smirnov Tests fall into the first 
category while the Davisson Test falls into the latter. 

Thus using detection modulus as a measure of effectiveness the non- 
parametric, Wilcoxon and Kolmogorov-Smirnov Tests and parametric Paired 
t and Davisson Test will be used to demonstrate the soectrum of tests 
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and their characteristics that might be used in validating a model with 
this type of data base. 

A. PAIRED t TEST 

Since we are testing the hypothesis that the real world and the model 
are sampling from the same distribution it is only natural to compare the 
differences in their outputs. Let d^. represent the difference between 
the real world and detection moduli for run i, and let 

1 

D - — ^ d. i-l,...,n. 

' i=l ' 

If S^ is defined as : 

S' = 

1=1 

then it can be shown [13] that 

f _ 'in D 
^ ■ S 

is asymptotically t distributed with n-1 degrees of freedom. 

Since this test assumes that the d^. are independent, the sample size 
is eight and each sample is the difference between the real world and 
model estimate of the true detection modulus computed with all eight 
buoys operating in concert. Table XII shows the actual data and part 
of the calculations. The corresponding P value is approximately .42. 

The Paired t Test has the advantages of the parametric tests, and 
Pq(x) is known exactly to be for large n. Since this distribution 

is well tabulated and the arithmetic is basic, the test is easy to 
perform. The test does make some very restrictive assumptions. The 
independence assumption forces the aggregation of the data to the extent 
that much information may be lost. The asymptotic property of the 
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distribution of the test statistic adds another degree of complexity 
for no longer is Pg(x) known exactly. It is known only in the limit 
as n increases. 



TABLE XII 



INDEPENDENT AIRCRAFT DATA FOR PAIRED t TEST 



RUN 


REAL WORLD 


MODEL 




d. 

1 


CL 

1 

o 


i 


D.M. 


D.M. 








1 


.4865 


.3478 




.1387 


.1748 


2 


.3587 


.4023 




-.0436 


.0075 


3 


.2500 


.4711 




-.2211 


.1850 


4 


.4134 


.5034 




-.0900 


.0539 


5 


.1729 


.2967 




-.1238 


.0877 


6 


.3884 


.4126 




-.0242 


.0119 


7 


.2601 


.2517 




.0094 


.0445 


8 


.3171 


.2702 




.0469 


.0830 




1 S 
S' 'tT 


(d.-D)2 = 


.08344 

7 


.012635 





i=l 



^ _ v/n D _ s[8 (-.0361 ) 

^ S .1124 

t = -.9084 

B. KOLMOGOROV-SMIRNOV TEST 

Another way of comparing the differences in the samples is by the 
Kolmogorov-Smirnov Test. The relative merits of the test have been dis- 
cussed, but this data presents an opportunity to compare the exact Pg(x) 
for the Kolmogorov-Smirnov Test to the previously used approximation of 
Pg(x). Using the data given in Table XII, the maximum deviation is 
.25, and by the previous approximation to Pg(x) the corresponding P 
value is .9639 whereas by Massey's exact computation [18] the P value 
is .6602. This very large difference indicates the dangers in using 
this approximation to Pg(x). 
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C. WILCOXON TEST WITH NORMAL APPROXIMATION 

The Wilcoxon Test when performed on the data of Table XII and with 
use of the normal approximation to Pq(x) yields a P value of .46. The 
relative merits of this test are the same as described previously, and 
the results are given only for comparative purposes. 

0. DAVISSON TEST WITH DEPENDENCE 

In the past three tests the sample size was eight due to the fact 
that independence among samples was required by each test. What if 
one wanted to compare the average detection modulus of each buoy on 
each run or perhaps the average detection modulus in each five mile 
range band from -50 to 50 miles for each buoy on each run? In these 
cases and the many others that might be considered the values are 
dependent on each other and thus none of the assumptions of the tests 
mentioned so far are completely satisfied. 

Davisson has shown that by comparing certain differences between the 
real world and model results that the maximum likelihood ratio yields a 
statistic with a known distribution [11]. 

Since the test is very tedious only a relatively short comparison 
with the aircraft data will be given. Consider the detection moduli 
of buoys 3, 4, and 5 on each run. The random variable to be tested is 
the average detection modulus of each buoy. Thus the null hypothesis 
is that the average detection moduli for buoys 3, 4, and 5 are the same 
in the real world as they are in the model and that their interdependence 
is also identical. 

The first step in determining the test statistic is to compute the 
variance-covariance matrix of the computer's average detection moduli. 
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To do this the average detection modulus for each run on a buoy is 
subtracted from the average for that buoy over all runs. 

The results are shown in Tables XIII and XIV. 



TABLE XIII 

AVERAGE BUOY DETECTION MODULI FOR THE MODEL AND THEIR AVERAGES 

BUOY 







3 


4 


5 




1 


.240755 


.560000 


.491032 




2 


.365082 


.457377 


.595555 




3 


.104921 


.596825 


.514203 


RUN 


4 


.584210 


.941052 


.782857 


5 


.051273 


.164909 


.158269 




6 


.496562 


.375031 


.181154 




7 


.038730 


.465397 


.579434 




8 


.209259 


.563148 


.468054 


AVERAGE 




.261349 


.513217 


.468054 



TABLE XIV 



AVERAGE 


MODEL DETECTION 


MODULI MINUS THEIR 


AVERAGES 






BUOY 






3 


4 


5 


1 


-.020594 


.046783 


-.049022 


2 


.103733 


-.055840 


.127501 


3 


-.156428 


.083608 


.046149 


4 


.322862 


.427835 


.314803 


RUN 5 


-.210076 


-.348308 


-.309785 


6 


.235214 


-.156186 


-.286900 


7 


-.222619 


-.047820 


.111380 


8 


-.052090 


.049931 


.045875 


The transpose of the 8x3 matrix in Table XIV when 


1 multiplied 



itself yields the variance-covariance matrix 0. The Q matrix is shown 
in Table XV. 



43 



TABLE XV 



VARIANCE-COVARIANCE MATRIX Q 

.036453 .020347 .0098831 

.020347 .043229 .034850 

.0098831 .034850 .039085 

Now a difference vector M is computed with each component being equal 

to the difference between the average real world detection modulus 

overall eight runs and the corresponding results from the model. 



TABLE XVI 



DIFFERENCE VECTOR M 



REAL WORLD MODEL 

AVERAGE AVERAGE 

.449466 .261349 

.365698 .513217 

.539570 .468054 



M 

.1881 

-.1475 

.0715 



Davisson has stated [11] that the distribution of 



m'^^q'^m 



is asymptotically chi -squared with N degrees of freedom where N is the 
dimension of Q. In this case M^Q”^M is 9.7682 and the corresponding 
P value is .02. 

As was the case with the Paired t Test, this test has the advantages 
of being parametric but the disadvantages of its asymptotic properties 
and lack of knowledge of p-|(x). The main drawback of the Davisson Test 
is its computational difficulty. As the dimension of Q increases a 
large computer becomes necessary and the sorting of data becomes quite 
tedious. Care must also be taken that accuracy is not lost in the 
inversion of Q and that subsets are chosen such that Q is not singular. 
In spite of all these disadvantages, the relief from the independence 
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assumption is very advantageous. If tolerance of its assumptions 
permits its use, the Davisson Test will yield a more detailed valida- 
tion test. It is now possible to reject part of the model while 
accepting the rest, thus allowing trouble-shooting for the simulation 
analysts. This feature was also possible with the submarine model but 
only because 10 different inputs were sampled and thus data collection 
had to be more extensive and also more costly. 

E. SUMMARY OF TESTS WITH AIRCRAFT DATA 

The P values corresponding to each of the four tests applied to the 
aircraft data are listed in Table XVII. 

TABLE XVII 

SUMMARY OF P VALUES USING PARAMETRIC AND 
NONPARAMETRIC TESTS ON THE AIRCRAFT MODEL DATA 



PAIRED t 


KOLMOGOROV-SMI RNOV 
EXACT APPROX. 


WILCOXON 


DAVISSON 


.42 


.6602 .9639 


.46 


.02 



It is not appropriate to compare the results of the Davisson Test 
to those of the other tests due to its unique properties, nor is it 
feasible to pass judgement on the remaining tests solely on the results 
in Table XVII. It should be noted however that the Kolmogorov-Smi rnov 
and Wilcoxon Test results are based on exact knowledge of Pq(x) while 
the Paired t Test and the approximate Kolmogorov-Smi rnov Test are not, 
and that no additional knowledge of p.|(x) is obtained by using these 
approximations. While the distribution of the Davisson Test statistic 
is not exact nor is information about p.|(x) available, it does allow a 
more localized validation thereby allowing "trouble-shooting" which 
the other tests do not permit. 
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As with the submarine data, these tests are far from an exhaustive 
set of all those possible. They were chosen to represent the range and 
spectrum of assumptions needed to perform the validation of this type 
model with its data base. 
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VII. SUMMARY AND CONCLUSIONS 



This paper has investigated the most salient problems of present day 
validation procedures and alleviated them by enlarging the scope of 
validation and by describing what is needed and can be expected from 
a statistical test with "validation type" data. It was shown that 
decision theory and cost analysis while present in previous validations 
received no mention, and that statistical testing with its pass or fail 
results did not allow the decision maker much flexibility. While only 
two simple decision rules and one type of decision criterion were 
presented, it became obvious that by determining P values from several 
tests and by trying to do such things as minimizing expected cost, the 
decision maker could avail himself of more information and have the 
capability to change more elements in his decision rule. 

A general methodology for the statistical testing of validation 
data was also discussed. Included in the methodology are the goals of 
a "validation test," the types of tests available with their inherent 
assumptions and properties, the need for multiple testing, and the 
pitfalls of relaxing assumptions within a test. 

It was seen that while a myriad of possible tests exists, those 
having exact knowledge of Pq(x) and p.|(x) will be the best. But, since 
p-|(x) is seldom known due to the nature of the alternate hypothesis and 
calculation procedures necessitate approximations to Pq(x) in many 
cases, these desirable tests are not always available. Some tests are 
clearly better than others, but in general, it was seen that several 
tests using different assumptions should be used to achieve the most 
reliable information about P and B. 
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In conclusion the problems of validation are analogous to those of 
systems analysis and cost-effectiveness. The goal or criterion can be 
defined as minimization of expected cost for a fixed level of validity, 
yet the methods of exact determination are not as well defined and need 
to be considered in concert instead of individually. In the past, one 
of the methods was statistical testing. When used alone there existed 
reasons to criticize the validations but when used in the procedure as 
presented in this paper, the validator has more flexibility and is able 
to use more information from his data and other sources. 

Another important advantage of this procedure is the increased 
ability to see the effects of changes in a decision rule. All that 
could be seen previously was that at a significance level of .6 the 
model was considered invalid but at a .4 level it was not. Now such 
things as the changes in a decision rule caused by refusing to accept 
a priori knowledge of the states of nature can be observed. 

So just as was done with systems analysis a new approach or way of 
looking at a problem has been proposed. This time it is to help the 
decision maker with his important and complex problems of model 
validation. 
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VIII. AREAS FOR FUTURE STUDY 



Since this paper represents a pilot study in the expansion of model 
validation, almost any facet of the paper could and should be expanded. 

The area of simulation theory is normally not considered an O.R. 
problem at least in the context of calibrating the model. The search 
for more nearly perfect statistical tests is also considered as second 
in importance to the development of decision rules applicable to model 
validation. 

After several decision rules have been presented then case studies 
similar to those of systems analysis will make a valuable contribution 
to the field of validation. 
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